According to Epsilon research, 80% of customers are more likely to do business with a company if it provides personalized service. Banking is no exception. The digitalization of everyday lives means that customers expect services to be delivered in a personalized and timely manner, often before they have even realized they need the service.
Santander Group aims to go a step beyond providing a customer with financial service and intends to determine the amount or value of the customer’s transaction. The primary focus is on Digital Delivery of Financial Services, reinforcing distribution and the overall customer experience in the new digital environment. The company strives to achieve it by using Big Data Analytics with platforms to leverage financial and non-financial information. This means anticipating customer needs in a more concrete, but also simple and personal way.
With so many choices for financial services, this need is greater now than ever before. This is a first step that Santander strives to nail in order to personalize their services at scale.
A US bank used machine learning to study the discounts its private bankers were offering to customers. Bankers claimed that they offered discounts only to valuable ones and more than made up for that with other, high-margin business. The analytics showed something different: patterns of unnecessary discounts that could easily be corrected. After the unit adopted the changes, revenues rose by 8% within a few months.
A top consumer bank in Asia enjoyed a large market share but lagged behind its competitors in products per customer. It used advanced analytics to explore several sets of big data: customer demographics and key characteristics, products held, credit-card statements, transaction and point-of-sale data, online and mobile transfers and payments, and credit-bureau data. The bank discovered unsuspected similarities that allowed it to define 15,000 microsegments in its customer base. It then built a next-product-to-buy model that increased the likelihood of buying three times over.
Project Objectives:
Santander Group wants to predict the value of future customer transactions (target column) in the test set with the minimal error. The evaluation metric for this project is Root Mean Squared Logarithmic Error.
To solve that challenge, we are planning to follow CRISP-DM outline:
We are provided with an anonymized dataset containing numeric feature variables, the numeric target column, and a string ID column.
File descriptions:
transaction.data <- read.csv(file="train.csv", header=TRUE, sep=",")
test_non_zero_base<-read.csv("test.csv", header= TRUE, sep=",")
attach(transaction.data)
attach(test_non_zero_base)
options("scipen" = 999, "digits" = 10)
str(transaction.data, list.len = 10, vec.len = 5)
## 'data.frame': 4459 obs. of 4993 variables:
## $ ID : Factor w/ 4459 levels "000d6aaf2","000fbd867",..: 1 2 3 4 5 6 7 8 9 10 11 12 ...
## $ target : num 38000000 600000 10000000 2000000 14400000 2800000 164000 600000 979000 460000 1100000 16000000 ...
## $ X48df886f9: num 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X0deb4b6a8: int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X34b15f335: num 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ a8cb14b00 : int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X2f0771a37: int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X30347e683: int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ d08d1fbe3 : int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## $ X6ee66e115: int 0 0 0 0 0 0 0 0 0 0 0 0 ...
## [list output truncated]
summary <- summary.data.frame(transaction.data)
summary[1:6, 1:10]
## ID target X48df886f9
## 000d6aaf2: 1 Min. : 30000 Min. : 0.00
## 000fbd867: 1 1st Qu.: 600000 1st Qu.: 0.00
## 0027d6b71: 1 Median : 2260000 Median : 0.00
## 0028cbf45: 1 Mean : 5944923 Mean : 14654.93
## 002a68644: 1 3rd Qu.: 8000000 3rd Qu.: 0.00
## 002dbeb22: 1 Max. :40000000 Max. :20000000.00
## X0deb4b6a8 X34b15f335 a8cb14b00
## Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.000 Median : 0.00 Median : 0.000
## Mean : 1390.895 Mean : 26722.45 Mean : 4530.164
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :4000000.000 Max. :20000000.00 Max. :14800000.000
## X2f0771a37 X30347e683 d08d1fbe3
## Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 26409.96 Mean : 30708.11 Mean : 16865.22
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :100000000.00 Max. :20708000.00 Max. :40000000.00
## X6ee66e115
## Min. : 0.000
## 1st Qu.: 0.000
## Median : 0.000
## Mean : 4669.208
## 3rd Qu.: 0.000
## Max. :10400000.000
Preliminary observations:
Time series nature - the dataset appears to be a time series in both dimensions, row wise and column wise.
Disguised meaning of the columns - each column seems to represent individual transaction amounts, possibly related to different types.
library(DataExplorer)
library(ggplot2)
library(data.table)
library(dplyr)
library(plotly)
library(e1071)
library(tidyr)
library(purrr)
library(compare)
library(tidyverse)
library(caret)
library(leaps)
library(MASS)
library(mltools)
library(rpart)
library(arules)
library(arulesViz)
First, we want to assess the data quality in terms of missing values and take a closer look at the target variable and its distribution.
#plot_missing(transaction.data)
#transaction.data[!complete.cases(transaction.data),]
#sapply(transaction.data, function(x) sum(is.na(x)))
#Due to the size of the data set, commands above are difficult to print in the report
sum(is.na(transaction.data))
## [1] 0
ggplot(transaction.data,aes(x=target))+geom_histogram(fill="blue",bins=50)+scale_x_continuous(trans='log2')+ggtitle("Histogram Distribution of Target")
box_plot <- ggplot(transaction.data, aes(y= target)) +
geom_boxplot() +
ylab("Target") +
scale_y_continuous(trans='log2')+
ggtitle("Box Plot of Target")
box_plot
qqnorm(transaction.data$target,
datax = TRUE,
col = "red",
main = "Normal Q-Q Plot of target Distribution")
qqline(transaction.data$target,
col = "blue",
datax = TRUE)
(min(target))
## [1] 30000
(max(target))
## [1] 40000000
(target_lcutoff <- quantile(target,.25))
## 25%
## 600000
(target_ucutoff <- quantile(target,.75))
## 75%
## 8000000
(median(target))
## [1] 2260000
(mean(target))
## [1] 5944923.322
As we can see, there are no missing values. The target variable is not normally distributed with several outliers that we will need to pay attention to during Data Preparation stage. The mean is higher than the median, so the distribution is right-skewed. Also, looking at the min and max, the range is very wide.
Next, we dig deeper into the preliminary observations from the previous section. The broader hypothesis that we analyze is: columns and rows were originally time ordered and then shuffled for the competition.
num_rows / days_in_week : 4459 / 7 = 637
num_cols / days_in_week : 4991 / 7 = 713
This serves as an additional point in support of the hypothesis that the data represents weekly transactional activity. Based on other observations, this dataset does not seem to contain any aggregate features.
To prepare for modeling, we want to better understand the meaning of columns & rows. Further, we evaluate whether all the data is truly significant for our analysis. The key criterion is the number of zeros vs. the number of unique values.
tran.data.zero<-data.table(transaction.data)
n_zeros <- tran.data.zero[, lapply(.SD, function(x) sum(x == 0) / length(x))] %>% unlist
a <-list(
autotick = FALSE,
ticks = "outside",
tick0 = 0.6,
dtick = 0.1,
range = c(0.6, 1),
ticklen = 5,
tickwidth = 2,
tickcolor = toRGB("blue")
)
plot_ly(x = ~n_zeros, type = "histogram",
marker = list(color = "dodgerblue")) %>%
layout(xaxis = a, title = "Histogram of % of zeros in dataset", titlefont = list(color = '#000000', size = 20), margin = list(l = 50, t=40))
Source: Kaggle
As a start, we select the subset of the training data where columns and rows have more than 1000 non-zero values.
x<-colSums(transaction.data != 0)
y<-colnames(transaction.data)
x_name<-"Count"
y_name<-"Col_name"
Train_nz<- data.frame(x, y)
colnames(Train_nz) <- c(x_name, y_name)
#Include columns with non_zero values greater than 1000
Subset1<-Train_nz[Train_nz$Count>1000,]
Subset1$Col_name<-as.character(Subset1$Col_name)
#head(Subset1$Col_name)
#str(Subset1$Col_name)
train_non_zero<-transaction.data[Subset1$Col_name]
#head(train_non_zero,3)
w<-rowSums(transaction.data != 0)
t<-rownames(transaction.data)
w_name<-"Count"
t_name<-"Row_name"
Train_nz2<- data.frame(w, t)
colnames(Train_nz2) <- c(w_name, t_name)
#head(Train_nz2)
#Include rows with non_zero values greater than 1000
Subset1a<-Train_nz2[Train_nz2$Count>1000,]
Subset1a$Row_name<-as.character(Subset1a$Row_name)
#head(Subset1a$Row_name)
#str(Subset1a$Row_name)
train_non_zero<-train_non_zero[Subset1a$Row_name,]
head(train_non_zero,3)
## ID target X20aa07010 X963a49cdc X26fc93eb7 X0572565c2
## 115 06cd03d09 3095200.00 4005500 3420000 8260400.00 40000.00
## 128 077787b5d 10814666.66 3420000 25300000 2253333.34 2300000.00
## 181 0a6e926ca 11973333.34 14145000 11115200 547000.00 16733333.34
## X66ace2992 fb49e4212 X6619d81fc X6eef030c1 fc99f9426 X1db387535
## 115 0 16818666.66 3252500 1859333.34 22134000 2090000.00
## 128 25100000 22134000.00 4870000 16818666.66 0 1838333.34
## 181 819750 40000.00 0 2286500.00 2300000 356000.00
## b43a7cfd5 X024c577b9 X2ec5b290f X0ff32eb98 X58e056e12 X241f0f867
## 115 282333.34 1554666.66 3012000 547000 3862000.0 10491200
## 128 1448888.88 8260400.00 3252500 370000 147090.9 2090000
## 181 450000.00 1448888.88 22134000 908000 8260400.0 25100000
## X1931ccfdd X58e2e02e6 X9fd594eec fb0f5dbfe X91f701ba2 X703885424
## 115 14145000 0.00 7279000 111791333.3 25100000 1448888.88
## 128 11115200 5352666.66 10491200 2286500.0 356000 547000.00
## 181 707000 10491200.00 0 3420000.0 922000 400000.00
## eeb9cd3aa X324921c7b X58232a6fb X491b9ee45 d6bb78916 X70feb1494
## 115 1060000.00 11973333.34 10814666.66 450000 42000 2190000
## 128 3012000.00 23985000.00 2190000.00 400000 14145000 707000
## 181 16818666.66 3252500.00 23985000.00 600000 2190000 0
## adb64ff71 X62e59a501 X15ace8c9f X5c6487af1 f190486d6 f74e8f13d
## 115 25300000.00 5352666.66 972000.0 147090.90 3530500.00 2286500
## 128 16733333.34 30000000.00 0.0 819750.00 11973333.34 40000
## 181 374000.00 2090000.00 147090.9 2253333.34 3012000.00 25300000
## c5a231d81 e176a204a X1702b5bf0 X190db8488 c47340d97 X23310aa6f
## 115 11115200 30000000.00 3056888.88 819750 2253333.34 23985000
## 128 374000 600000.00 450000.00 922000 4900000.00 0
## 181 430000 1838333.34 30000000.00 4900000 370000.00 4870000
write.csv(train_non_zero, file = "train_non_zero.csv",row.names=FALSE)
This approach allows to identify ~40 variables and ~80 observations that appear to be the most impactful for the target variable. We also added a column with the mean value for each row.
Proceeding further, more advanced algorithms could be used to detect the patterns between columns and rows. The subset below is the result of a mix of feature importance, sorting columns and rows by sum of non-zeros, and correlation plus RMSE between columns.
trans.data<-fread("train.csv", header = TRUE)
training.data <- trans.data[, c("ID","target","f190486d6", "58e2e02e6", "eeb9cd3aa", "9fd594eec", "6eef030c1", "15ace8c9f", "fb0f5dbfe", "58e056e12", "20aa07010", "024c577b9", "d6bb78916", "b43a7cfd5", "58232a6fb", "1702b5bf0", "324921c7b", "62e59a501", "2ec5b290f", "241f0f867", "fb49e4212", "66ace2992", "f74e8f13d", "5c6487af1", "963a49cdc", "26fc93eb7", "1931ccfdd", "703885424", "70feb1494", "491b9ee45", "23310aa6f", "e176a204a", "6619d81fc", "1db387535", "fc99f9426", "91f701ba2", "0572565c2", "190db8488", "adb64ff71", "c47340d97", "c5a231d81", "0ff32eb98"), with = F]
training.data <- training.data[c(1757, 3809, 511, 3798, 625, 3303, 4095, 1283, 4209, 1696, 3511, 816, 245, 1383, 2071, 3492, 378, 2971, 2366, 4414, 2790, 3979, 193, 1189, 3516, 810, 4443, 3697, 235, 1382, 4384, 3418, 4396, 921, 3176, 650),]
head(training.data,5)
## ID target f190486d6 58e2e02e6 eeb9cd3aa 9fd594eec 6eef030c1
## 1: 67239ac88 400000 0 0 1470000 0 1400000
## 2: d941e0d1a 220000 50000 130000 0 20000 0
## 3: 1dedfdbac 22000000 18500000 22500000 25000000 32500000 25000000
## 4: d8dc14291 230000 0 0 0 0 0
## 5: 24005e477 1100000 0 0 0 0 0
## 15ace8c9f fb0f5dbfe 58e056e12 20aa07010 024c577b9 d6bb78916
## 1: 0 0 0 0 1333333.34 0.00
## 2: 0 0 0 120000 70000.00 0.00
## 3: 22500000 30000000 25000000 34000000 24000000.00 23333333.34
## 4: 0 0 0 0 0.00 0.00
## 5: 0 0 0 0 0.00 0.00
## b43a7cfd5 58232a6fb 1702b5bf0 324921c7b 62e59a501 2ec5b290f 241f0f867
## 1: 0.00 0 0 0 0 0 0
## 2: 90000.00 0 0 10000 0 0 0
## 3: 16666666.66 30000000 20000000 42500000 0 24000000 26250000
## 4: 0.00 0 0 0 100000 0 0
## 5: 0.00 0 0 0 0 0 0
## fb49e4212 66ace2992 f74e8f13d 5c6487af1 963a49cdc 26fc93eb7
## 1: 0 0 0 0 0 557333.34
## 2: 2000000 0 0 0 0 0.00
## 3: 30000000 0 32000000 28000000 30000000 27500000.00
## 4: 0 0 0 0 0 0.00
## 5: 0 0 1800000 0 0 0.00
## 1931ccfdd 703885424 70feb1494 491b9ee45 23310aa6f e176a204a
## 1: 557333.34 557333.34 557333.34 557333.34 557333.34 557333.34
## 2: 262000.00 0.00 0.00 0.00 0.00 0.00
## 3: 26000000.00 30000000.00 45714285.72 50000000.00 33333333.34 50000000.00
## 4: 0.00 0.00 0.00 0.00 0.00 0.00
## 5: 0.00 0.00 0.00 0.00 0.00 0.00
## 6619d81fc 1db387535 fc99f9426 91f701ba2 0572565c2 190db8488
## 1: 557333.34 557333.34 557333.34 557333.34 557333.34 557333.34
## 2: 0.00 0.00 0.00 0.00 20000.00 0.00
## 3: 48333333.34 70591666.66 50000000.00 40000000.00 0.00 48333333.34
## 4: 260000.00 0.00 0.00 30000.00 0.00 0.00
## 5: 0.00 0.00 0.00 0.00 0.00 0.00
## adb64ff71 c47340d97 c5a231d81 0ff32eb98
## 1: 557333.34 557333.34 557333.34 557333.34
## 2: 0.00 20000.00 0.00 0.00
## 3: 50000000.00 50000000.00 44285714.28 37000000.00
## 4: 0.00 0.00 0.00 0.00
## 5: 0.00 0.00 0.00 0.00
We could also consider Principal Component Analysis to further group the variables.
train2<-subset(train_non_zero,select=-c(target,ID))
pc<-prcomp(train2)
summary(pc)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 20475545.71431 16281704.14251 16049859.15688
## Proportion of Variance 0.11416 0.07219 0.07015
## Cumulative Proportion 0.11416 0.18635 0.25650
## PC4 PC5 PC6
## Standard deviation 15171346.42217 15061785.18934 14678517.66934
## Proportion of Variance 0.06268 0.06177 0.05867
## Cumulative Proportion 0.31917 0.38095 0.43962
## PC7 PC8 PC9
## Standard deviation 14294331.29585 13978808.53868 13738964.85435
## Proportion of Variance 0.05564 0.05321 0.05140
## Cumulative Proportion 0.49526 0.54847 0.59987
## PC10 PC11 PC12
## Standard deviation 13488806.68764 13346625.69524 13236441.35441
## Proportion of Variance 0.04955 0.04851 0.04771
## Cumulative Proportion 0.64941 0.69792 0.74563
## PC13 PC14 PC15
## Standard deviation 12726005.90218 12164146.59667 12022579.39591
## Proportion of Variance 0.04410 0.04029 0.03936
## Cumulative Proportion 0.78973 0.83002 0.86938
## PC16 PC17 PC18
## Standard deviation 8883204.77693 7560874.57813 6737724.91004
## Proportion of Variance 0.02149 0.01557 0.01236
## Cumulative Proportion 0.89087 0.90644 0.91880
## PC19 PC20 PC21
## Standard deviation 6393197.31967 6204817.93706 6028077.42601
## Proportion of Variance 0.01113 0.01048 0.00989
## Cumulative Proportion 0.92993 0.94041 0.95031
## PC22 PC23 PC24
## Standard deviation 5272242.41257 4987742.62709 4836308.49398
## Proportion of Variance 0.00757 0.00677 0.00637
## Cumulative Proportion 0.95788 0.96465 0.97102
## PC25 PC26 PC27
## Standard deviation 4625934.22936 3869723.65323 3627766.20574
## Proportion of Variance 0.00583 0.00408 0.00358
## Cumulative Proportion 0.97685 0.98092 0.98451
## PC28 PC29 PC30
## Standard deviation 3346119.27742 3192193.95793 3042755.02716
## Proportion of Variance 0.00305 0.00277 0.00252
## Cumulative Proportion 0.98756 0.99033 0.99285
## PC31 PC32 PC33
## Standard deviation 2594330.73785 2250931.94054 2053204.15570
## Proportion of Variance 0.00183 0.00138 0.00115
## Cumulative Proportion 0.99469 0.99607 0.99721
## PC34 PC35 PC36
## Standard deviation 1832187.49365 1529464.92610 1194272.86586
## Proportion of Variance 0.00091 0.00064 0.00039
## Cumulative Proportion 0.99813 0.99877 0.99915
## PC37 PC38 PC39
## Standard deviation 1144744.86569 914233.73949 810061.91687
## Proportion of Variance 0.00036 0.00023 0.00018
## Cumulative Proportion 0.99951 0.99974 0.99992
## PC40
## Standard deviation 553518.00408
## Proportion of Variance 0.00008
## Cumulative Proportion 1.00000
#plot(pc)
plot(pc,type="l")
biplot(pc)
#attributes(pc)
The Principal Component analysis identified 40 components which is close to the total number of variables in our latest subset. Therefore, we proceed with train_non_zero subset.
Once the desired subset is selected, we analyze its structure and identify the necessary elements for the data preparation process.
str(train_non_zero, list.len = 10, vec.len = 5)
## 'data.frame': 81 obs. of 42 variables:
## $ ID : Factor w/ 4459 levels "000d6aaf2","000fbd867",..: 109 121 173 186 207 228 238 373 491 505 515 622 ...
## $ target : num 3095200 10814667 11973333 500000 5187333 406000 ...
## $ X20aa07010: num 4005500 3420000 14145000 0 972000 5450500 ...
## $ X963a49cdc: num 3420000 25300000 11115200 115636 0 440000 ...
## $ X26fc93eb7: num 8260400 2253333 547000 540000 2286500 3076667 ...
## $ X0572565c2: num 40000 2300000 16733333 1563412 2090000 1563412 ...
## $ X66ace2992: num 0 25100000 819750 247167 3012000 466462 ...
## $ fb49e4212 : num 16818667 22134000 40000 550000 5352667 3147200 ...
## $ X6619d81fc: num 3252500 4870000 0 1563412 450000 1015000 ...
## $ X6eef030c1: num 1859333 16818667 2286500 2352552 0 500000 ...
## [list output truncated]
summary.subset <- summary.data.frame(train_non_zero)
summary.subset[1:6, 1:10]
## ID target X20aa07010 X963a49cdc
## 06cd03d09: 1 Min. : 42000 Min. : 0 Min. : 0
## 077787b5d: 1 1st Qu.: 406000 1st Qu.: 440000 1st Qu.: 450000
## 0a6e926ca: 1 Median : 1304800 Median : 1563412 Median : 1563412
## 0b132f2c6: 1 Mean : 3215726 Mean : 5035783 Mean : 3871480
## 0cba9fad9: 1 3rd Qu.: 3244909 3rd Qu.: 3513333 3rd Qu.: 3012000
## 0e1920aa8: 1 Max. :35050000 Max. :111791333 Max. :30000000
## X26fc93eb7 X0572565c2 X66ace2992
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 547000 1st Qu.: 550000 1st Qu.: 406000
## Median : 1563412 Median : 1563412 Median : 1563412
## Mean : 3860970 Mean : 3255003 Mean : 3897239
## 3rd Qu.: 3012000 3rd Qu.: 1600000 3rd Qu.: 3076667
## Max. :30000000 Max. :30000000 Max. :30000000
## fb49e4212 X6619d81fc X6eef030c1
## Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 450000 1st Qu.: 600000 1st Qu.: 307000
## Median : 1563412 Median : 1563412 Median : 1060000
## Mean : 3707250 Mean : 3385029 Mean : 4765962
## 3rd Qu.: 3147200 3rd Qu.: 1600000 3rd Qu.: 3147200
## Max. :30000000 Max. :35050000 Max. :111791333
plot_histogram(train_non_zero)
plot_correlation(train_non_zero,type="continuous")
Note: the provided test set does not contain the target variable, so our evaluation of the models will have to be based on the training set. However, we still apply all the necessary data preparation steps to the test set to showcase the process.
valid_colnames<-colnames(train_non_zero)
valid_colnames<-valid_colnames[1:42]
valid_ID<-as.character(train_non_zero$ID)
train_valid_names<-names(transaction.data)[names(transaction.data) %in% valid_colnames]
train_valid <-transaction.data[, train_valid_names]
f<-rowSums(transaction.data != 0)
p<-rownames(transaction.data)
f_name<-"Count"
p_name<-"Row_name"
Train_nz4<- data.frame(f, p)
colnames(Train_nz4) <- c(f_name, p_name)
#head(Train_nz4)
#Include rows with non_zero values less than 970 and more than 600
Subset1c<-Train_nz4[Train_nz4$Count<970 & Train_nz4$Count>600 ,]
Subset1c$Row_name<-as.character(Subset1c$Row_name)
#head(Subset1b$Row_name)
#str(Subset1b$Row_name)
train_row_names_valid<-rownames(transaction.data)[rownames(transaction.data) %in% Subset1c$Row_name]
train_valid<-train_valid[train_row_names_valid, ]
train_valid_scaled<-scale(train_valid[,-1])
train_valid_scaled<-data.frame(train_valid_scaled)
train_valid_scaled$ID<-train_valid$ID
train_valid_scaled<-train_valid_scaled[c(42,1:41)]
train_valid_scaled$target<-train_valid$target
N1a <- length(train_valid_scaled[,3])
nbins1a <- 5
whichbin1a <- c(rep(0, N1a))
freq1a <- N1a/nbins1a
train_valid_scaled <- train_valid_scaled[order(train_valid_scaled[,3]),]
for (i in 1:nbins1a) {
for (j in 1:N1a) {
if((i-1)*freq1a < j && j <=i*freq1a)
whichbin1a[j] <- i
}
}
whichbin1a<-gsub(pattern = "1", replacement = "VLow", whichbin1a)
whichbin1a<-gsub(pattern = "2", replacement = "Low", whichbin1a)
whichbin1a<-gsub(pattern = "3", replacement = "Medium", whichbin1a)
whichbin1a<-gsub(pattern = "4", replacement = "High", whichbin1a)
whichbin1a<-gsub(pattern = "5", replacement = "VHigh", whichbin1a)
train_valid_scaled[,3]<-whichbin1a
First, we create a similar subset from the test data, using exactly the same columns and similar non-zero rows.
#Compare test and train base files
comparison <- compare(transaction.data,test_non_zero_base,allowAll=TRUE)
comparison$tM
semi_join(transaction.data,test_non_zero_base)
It seems like there aren’t any common rows.
subset_colnames<-colnames(train_non_zero)
subset_colnames<-subset_colnames[1:42]
subset_ID<-as.character(train_non_zero$ID)
test_names<-names(test_non_zero_base)[names(test_non_zero_base) %in% subset_colnames]
test_ID<-test_non_zero_base$ID[test_non_zero_base$ID %in% subset_ID]
test_non_zero <-test_non_zero_base[, test_names]
z<-rowSums(test_non_zero_base != 0)
q<-rownames(test_non_zero_base)
z_name<-"Count"
q_name<-"Row_name"
Train_nz3<- data.frame(z, q)
colnames(Train_nz3) <- c(z_name, q_name)
#head(Train_nz3)
#Include rows with non_zero values greater than 970
Subset1b<-Train_nz3[Train_nz3$Count>970,]
Subset1b$Row_name<-as.character(Subset1b$Row_name)
#head(Subset1b$Row_name)
#str(Subset1b$Row_name)
test_row_names<-rownames(test_non_zero)[rownames(test_non_zero) %in% Subset1b$Row_name]
test_non_zero<-test_non_zero[test_row_names, ]
head(test_non_zero,3)
## ID X20aa07010 X963a49cdc X26fc93eb7 X0572565c2 X66ace2992
## 874 046868511 0 0.00 136800.00 5510800 59333.34
## 1369 07061afcb 8120000 20000000.00 340666.66 186000 380000.00
## 1584 08337f345 380000 110666.66 186000.00 656400 3840000.00
## fb49e4212 X6619d81fc X6eef030c1 fc99f9426 X1db387535 b43a7cfd5
## 874 78000.00 6060000.00 186000 4609500.00 39000.00 1427230.76
## 1369 8478666.66 3840000.00 5345600 211333.34 8583428.58 1372800.00
## 1584 1137555.56 1427230.76 4139600 237000.00 340000.00 20000000.00
## X024c577b9 X2ec5b290f X0ff32eb98 X58e056e12 X241f0f867 X1931ccfdd
## 874 221000 190800 1347166.66 115500.00 1200000.00 0.00
## 1369 11700000 2290800 221000.00 0.00 9731666.66 664000.00
## 1584 564800 1020500 15800000.00 8478666.66 970500.00 1124166.66
## X58e2e02e6 X9fd594eec fb0f5dbfe X91f701ba2 X703885424 eeb9cd3aa
## 874 8583428.58 110666.66 12686800.00 12307000.00 1000000 211333.34
## 1369 2340400.00 369200.00 2600000.00 110666.66 6206000 5600000.00
## 1584 0.00 6062500.00 9731666.66 383200.00 12686800 52932000.00
## X324921c7b X58232a6fb X491b9ee45 d6bb78916 X70feb1494 adb64ff71
## 874 383200 340000.00 1554666.66 200000 500000.00 911000
## 1369 6062500 0.00 1020500.00 22529000 525066.66 12686800
## 1584 6206000 340666.66 0.00 3761200 115500.00 1200000
## X62e59a501 X15ace8c9f X5c6487af1 f190486d6 f74e8f13d c5a231d81
## 874 656400.00 1124166.66 2456800.00 3840000.00 15800000.00 16000.00
## 1369 4139600.00 3677000.00 3761200.00 6266666.66 564800.00 0.00
## 1584 525066.66 2290800.00 211333.34 1372800.00 8583428.58 59333.34
## e176a204a X1702b5bf0 X190db8488 c47340d97 X23310aa6f
## 874 3540000.00 237000 984000.00 0 1180000
## 1369 1137555.56 52932000 1124166.66 115500 970500
## 1584 200000.00 664000 190800.00 78000 221000
write.csv(test_non_zero, file = "test_non_zero.csv",row.names=FALSE)
#Compare test and train subsets
comparison <- compare(train_non_zero,test_non_zero,allowAll=TRUE)
comparison$tM
semi_join(train_non_zero,test_non_zero)
Next, we conduct several two-sample T tests for difference in means. The null hypothesis is that the means are similar and the partition is valid. The alternative hypothesis is that the means are significantly different and the partition is invalid. We assume the significance level of 5%.
mean1<-mean(train_non_zero[,3])
mean2<-mean(test_non_zero[,2])
sd1<-sd(train_non_zero[,3])
sd2<-sd(test_non_zero[,2])
l1<-length(train_non_zero[,3])
l2<-length(test_non_zero[,2])
dfs <- min(l1 - 1, l2 - 1)
tdata <- (mean1 - mean2) / sqrt((sd1^2/l1)+(sd2^2/l2))
pvalue <- 2*pt(tdata, df = dfs, lower.tail=FALSE)
tdata; pvalue
## [1] -0.8084183653
## [1] 1.578626835
Based on the test for the first predictor column, the p-value is higher than 0.05, so we don’t have enough evidence to reject the null hypothesis and the partition appears valid.
mean3<-mean(train_non_zero[,4])
mean4<-mean(test_non_zero[,3])
sd3<-sd(train_non_zero[,4])
sd4<-sd(test_non_zero[,3])
l3<-length(train_non_zero[,4])
l4<-length(test_non_zero[,3])
dfs <- min(l3 - 1, l4 - 1)
tdata1 <- (mean3 - mean4) / sqrt((sd3^2/l3)+(sd4^2/l4))
pvalue1 <- 2*pt(tdata1, df = dfs, lower.tail=FALSE)
tdata1; pvalue1
## [1] -1.576235198
## [1] 1.880874647
The previous conclusion is confirmed by the next variable as well, so we will assume the partition is valid for the goals of the modeling.
To start off, we standardize both data sets using z-score method.
train_non_zero_scaled<-scale(train_non_zero[,-1])
train_non_zero_scaled<-data.frame(train_non_zero_scaled)
train_non_zero_scaled$ID<-train_non_zero$ID
train_non_zero_scaled<-train_non_zero_scaled[c(42,1:41)]
train_non_zero_scaled$target<-train_non_zero$target
#Plot correlation
plot_correlation(train_non_zero_scaled,type="continuous")
pairs(train_non_zero_scaled[2:10])
test_non_zero_scaled<-scale(test_non_zero[,-1])
test_non_zero_scaled<-data.frame(test_non_zero_scaled)
test_non_zero_scaled$ID<-test_non_zero$ID
test_non_zero_scaled<-test_non_zero_scaled[c(41,1:40)]
#Plot correlation
plot_correlation(test_non_zero_scaled,type="continuous")
pairs(test_non_zero_scaled[2:10])
Next, we analyze and, if necessary, remove outliers.
outliers <- function(dataframe){
dataframe %>%
select_if(is.numeric) %>%
map(~ boxplot.stats(.x)$out)
}
head(outliers(train_non_zero_scaled), 5)
## $target
## [1] 10814666.66 11973333.34 16818666.66 35050000.00 20000000.00 10491200.00
## [7] 8120000.00 16733333.34 12000000.00
##
## $X20aa07010
## [1] 0.6743469609 0.5135804809 1.4027930858 2.2219248309 0.8659588909
## [6] 0.4278054551 7.9030149593 0.4038595166 1.5001413230 0.2387154500
## [11] 0.8722760379 0.5155545891
##
## $X963a49cdc
## [1] 3.2334460526 1.0930375966 3.0350197710 3.2032671504 0.6622639473
## [6] 0.6410783580 0.9988794219 3.9426502531 1.9407830782 2.7557140316
## [11] 1.2225252071 1.9536594078 1.5502177852
##
## $X26fc93eb7
## [1] 0.6702133493 3.9820445877 3.2660421396 1.5666788825 1.0593327496
## [6] 1.9739877919 1.0100555325 2.7837307033 1.9609880331 1.1051162831
## [11] 3.0657137951 3.2355739504
##
## $X0572565c2
## [1] 2.1745625121878 0.3542163805001 0.0416784097881 0.8075591217507
## [5] 1.2681459247409 4.3149757051492 3.3445295723251 3.0458935304536
## [9] 3.5244210832310 0.2605598584545 0.2653999887928 -0.0004038356175
## [13] 3.5566886188195
Looking at the identified outliers, we don’t feel like we possess enough domain and client knowledge at this point to make a decision to remove outliers. Therefore, we will proceed to modelling with the full standardized data set and, if necessary, make iterative adjustments based on the learnings of the next phase.
Finally, we transform the target variable to improve its normality for modelling.
summary(train_non_zero_scaled$target)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 42000 406000 1304800 3215726 3244909 35050000
train_non_zero_scaled$log_target<-log(train_non_zero_scaled$target)
train_non_zero_scaled<-train_non_zero_scaled[c(1:2,43,3:42)]
skewness(train_non_zero_scaled$log_target)
## list()
ggplot(train_non_zero_scaled,aes(x=log_target))+geom_histogram(fill="blue")+ggtitle("Histogram of Normalized Target")
Based on the learnings from modelling, we might also consider normalizing predictor variables. However, we need to keep in mind the negative impact it would have on the ability to interpret the results of, for example, linear regression. Using base dollar values makes it much easier to understand the nature of the effect that predictors have on the target.
#train_non_zero_normal<-log(train_non_zero[,2:43])
For some of the numeric predictors, we see the opportunity to apply equal frequency binning.
#Binning one of the predictors
N1 <- length(train_non_zero_scaled[,4])
nbins1 <- 5
whichbin1 <- c(rep(0, N1))
freq1 <- N1/nbins1
train_non_zero_scaled <- train_non_zero_scaled[order(train_non_zero_scaled[,4]),]
for (i in 1:nbins1) {
for (j in 1:N1) {
if((i-1)*freq1 < j && j <=i*freq1)
whichbin1[j] <- i
}
}
whichbin1<-gsub(pattern = "1", replacement = "VLow", whichbin1)
whichbin1<-gsub(pattern = "2", replacement = "Low", whichbin1)
whichbin1<-gsub(pattern = "3", replacement = "Medium", whichbin1)
whichbin1<-gsub(pattern = "4", replacement = "High", whichbin1)
whichbin1<-gsub(pattern = "5", replacement = "VHigh", whichbin1)
train_non_zero_scaled[,4]<-whichbin1
#plot frequencies in the bins
barplot(table(train_non_zero_scaled[,4]))
N2 <- length(test_non_zero_scaled[,2])
nbins2 <- 5
whichbin2 <- c(rep(0, N2))
freq2 <- N2/nbins2
test_non_zero_scaled <- test_non_zero_scaled[order(test_non_zero_scaled[,2]),]
for (i in 1:nbins2) {
for (j in 1:N2) {
if((i-1)*freq2 < j && j <=i*freq2)
whichbin2[j] <- i
}
}
whichbin2<-gsub(pattern = "1", replacement = "VLow", whichbin2)
whichbin2<-gsub(pattern = "2", replacement = "Low", whichbin2)
whichbin2<-gsub(pattern = "3", replacement = "Medium", whichbin2)
whichbin2<-gsub(pattern = "4", replacement = "High", whichbin2)
whichbin2<-gsub(pattern = "5", replacement = "VHigh", whichbin2)
test_non_zero_scaled[,2]<-whichbin2
#plot frequencies in the bins
barplot(table(test_non_zero_scaled[,2]))
If necessary, we might want to consider binning based on predictive value or clarifying the best cut-off numbers with the client experts.
Also, as we plan to apply the classification decision tree, we have decided to try equal frequency binning for the target variable.
#Binning Target Variable
Nt <- length(train_non_zero_scaled$log_target)
nbinst <- 5
whichbint <- c(rep(0, Nt))
freqt <- Nt/nbinst
train_non_zero_scaled_sorted <- train_non_zero_scaled[order(train_non_zero_scaled$log_target),]
for (i in 1:nbinst) {
for (j in 1:Nt) {
if((i-1)*freqt < j && j <=i*freqt)
whichbint[j] <- i
}
}
whichbint1<-gsub(pattern = "1", replacement = "VLow", whichbint)
whichbint2<-gsub(pattern = "2", replacement = "Low", whichbint1)
whichbint3<-gsub(pattern = "3", replacement = "Medium", whichbint2)
whichbint4<-gsub(pattern = "4", replacement = "High", whichbint3)
whichbint5<-gsub(pattern = "5", replacement = "VHigh", whichbint4)
train_non_zero_scaled_sorted$bin_target<-whichbint5
train_non_zero_scaled_sorted<-train_non_zero_scaled_sorted[c(1:3,44,4:43)]
#plot frequencies in the bins
barplot(table(train_non_zero_scaled_sorted$bin_target))
1a. Clustering
We start with unsupervised clustering to see if we can identify any patterns of similarity across data rows. We apply hierarchical agglomerative clustering using default complete linkage.
##calculate distance matrix (default is Euclidean distance)
distance = dist(train_non_zero_scaled[,5:42])
# Hierarchical agglomerative clustering using default complete linkage
train.hclust = hclust(distance)
plot(train.hclust)
member = cutree(train.hclust,3)
table(member)
## member
## 1 2 3
## 75 2 4
##calculate the same for the test subset
distance2 = dist(test_non_zero_scaled[,3:41])
train.hclust2 = hclust(distance2)
member2 = cutree(train.hclust2,3)
table(member2)
## member2
## 1 2 3
## 51 25 1
Clustering analysis on the entire standardized training subset did not yield any significant results, around 98% of the records are in same cluster membership. As for the standardized test subset, the analysis shows the split of 51-25-1 across 3 clusters. Yet, to maintain consistency in evaluating the models, we will proceed with treating each test subset record individually.
1b. Linear regression
Next, we proceed to supervised methods to achieve the goal of predicting future customer transactions.
Given the disguised nature of columns, we start with applying backward step-wise linear regression on the whole standardized training subset with the normalized target variable.
#*use train_non_zero_scaled on log_target*
full.model<-lm(log_target ~., data = train_non_zero_scaled[,3:43])
step.model <- stepAIC(full.model, direction = "both", trace = FALSE)
summary(step.model)
##
## Call:
## lm(formula = log_target ~ X20aa07010 + X963a49cdc + X26fc93eb7 +
## X0572565c2 + fb49e4212 + X6619d81fc + X1db387535 + b43a7cfd5 +
## X024c577b9 + X2ec5b290f + X0ff32eb98 + fb0f5dbfe + X91f701ba2 +
## X703885424 + eeb9cd3aa + X324921c7b + X58232a6fb + d6bb78916 +
## X15ace8c9f + X5c6487af1 + c5a231d81 + e176a204a + X1702b5bf0 +
## X190db8488 + c47340d97 + X23310aa6f, data = train_non_zero_scaled[,
## 3:43])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9575153 -0.4854045 -0.0615563 0.4603415 3.2834165
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.3015273 0.3826437 37.37558 < 0.000000000000000222 ***
## X20aa07010Low -0.9272323 0.5597148 -1.65662 0.10373765
## X20aa07010Medium -0.4845062 0.5337094 -0.90781 0.36824890
## X20aa07010VHigh -0.4888379 0.5866908 -0.83321 0.40861033
## X20aa07010VLow 0.3442017 0.5447283 0.63188 0.53028918
## X963a49cdc -0.4133555 0.2636210 -1.56799 0.12306747
## X26fc93eb7 -0.4418430 0.2126401 -2.07789 0.04277142 *
## X0572565c2 0.4856014 0.3044659 1.59493 0.11690733
## fb49e4212 0.7229849 0.2333648 3.09809 0.00316500 **
## X6619d81fc 0.5531787 0.2761916 2.00288 0.05051916 .
## X1db387535 -0.7400279 0.3347031 -2.21100 0.03154598 *
## b43a7cfd5 0.7148540 0.2726452 2.62192 0.01149761 *
## X024c577b9 1.5820025 0.3248829 4.86946 0.000011239 ***
## X2ec5b290f 0.3013891 0.2088398 1.44316 0.15509013
## X0ff32eb98 -0.9261896 0.3159240 -2.93168 0.00503624 **
## fb0f5dbfe -0.4712750 0.2503986 -1.88210 0.06553437 .
## X91f701ba2 -0.3490632 0.2924326 -1.19365 0.23813646
## X703885424 0.6119260 0.3132018 1.95378 0.05622292 .
## eeb9cd3aa -1.2113018 0.3403336 -3.55916 0.00081531 ***
## X324921c7b 0.5611060 0.2313403 2.42546 0.01886749 *
## X58232a6fb 0.4494637 0.2218243 2.02621 0.04798842 *
## d6bb78916 0.3748001 0.2479296 1.51172 0.13677729
## X15ace8c9f 0.7253316 0.2248747 3.22549 0.00219722 **
## X5c6487af1 -0.5533078 0.2306333 -2.39908 0.02012951 *
## c5a231d81 -0.4154158 0.2234654 -1.85897 0.06880558 .
## e176a204a 0.3370184 0.2250149 1.49776 0.14035974
## X1702b5bf0 0.6240518 0.2024182 3.08298 0.00330322 **
## X190db8488 -0.6923377 0.2744539 -2.52260 0.01481109 *
## c47340d97 -1.1328816 0.3527651 -3.21143 0.00228839 **
## X23310aa6f 0.9149162 0.3612198 2.53285 0.01443283 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.282658 on 51 degrees of freedom
## Multiple R-squared: 0.5273541, Adjusted R-squared: 0.2585947
## F-statistic: 1.962179 on 29 and 51 DF, p-value: 0.017356
The final model is significant as a whole, but explains only 52% of the variation in the data.
#Evaluate using the same training set
train_non_zero_scaled$pred_target_value_reg<-exp(predict(step.model,train_non_zero_scaled))
(rmse1a<-RMSE(train_non_zero_scaled$target, train_non_zero_scaled$pred_target_value_reg))
## [1] 4512033.709
(rmsle1a<-rmsle(preds = as.numeric(train_non_zero_scaled$pred_target_value_reg), actuals = as.numeric(train_non_zero_scaled$target)))
## [1] 1.017776982
#Evaluate using the different part of the training set
train_valid_scaled$pred_target_value_reg<-exp(predict(step.model,train_valid_scaled))
(rmse1b<-RMSE(train_valid_scaled$target, train_valid_scaled$pred_target_value_reg))
## [1] 317679997.4
(rmsle1b<-rmsle(preds = as.numeric(train_valid_scaled$pred_target_value_reg), actuals = as.numeric(train_valid_scaled$target)))
## [1] 3.06331573
test_non_zero_scaled$pred_target_value_reg<-exp(predict(step.model,test_non_zero_scaled))
1c. Decision Tree
#*use train_non_zero_scaled on log_target*
tree1<-rpart(log_target ~ ., data=train_non_zero_scaled[,3:43], method="anova")
#summary(tree1)
tmp<-printcp(tree1)
##
## Regression tree:
## rpart(formula = log_target ~ ., data = train_non_zero_scaled[,
## 3:43], method = "anova")
##
## Variables actually used in tree construction:
## [1] c5a231d81 X024c577b9 X15ace8c9f X20aa07010 X2ec5b290f X6eef030c1
## [7] X9fd594eec
##
## Root node error: 177.52364/81 = 2.1916498
##
## n= 81
##
## CP nsplit rel error xerror xstd
## 1 0.158266624 0 1.00000000 1.0227599 0.14014302
## 2 0.091892319 1 0.84173338 1.2558885 0.19220387
## 3 0.055443903 2 0.74984106 1.4600181 0.20127916
## 4 0.054609331 3 0.69439715 1.4594684 0.20254675
## 5 0.052129518 4 0.63978782 1.5022374 0.21021319
## 6 0.035757169 6 0.53552879 1.6453383 0.21738165
## 7 0.010000000 7 0.49977162 1.6920258 0.22398330
(rsq.val <- 1-tmp[,3])
## 1 2 3 4 5
## 0.0000000000 0.1582666245 0.2501589435 0.3056028465 0.3602121771
## 6 7
## 0.4644712136 0.5002283827
plot(tree1, uniform=TRUE,
main="Regression Tree for Continuous Normal Target ", titlefont = 16)
text(tree1, use.n=TRUE, all=TRUE, cex=.65)
After the seventh split, this regression decision tree model also explains only 50% of the variation in the data.
#*use train_non_zero_scaled_sorted on bin_target*
tree2<-rpart(bin_target ~ ., data=train_non_zero_scaled_sorted[,4:44], method="class")
#summary(tree1)
printcp(tree2)
##
## Classification tree:
## rpart(formula = bin_target ~ ., data = train_non_zero_scaled_sorted[,
## 4:44], method = "class")
##
## Variables actually used in tree construction:
## [1] f190486d6 f74e8f13d fb49e4212 X15ace8c9f X6eef030c1
##
## Root node error: 64/81 = 0.79012346
##
## n= 81
##
## CP nsplit rel error xerror xstd
## 1 0.140625 0 1.00000 1.218750 0.026557393
## 2 0.062500 2 0.71875 1.031250 0.054625461
## 3 0.031250 4 0.59375 1.000000 0.057265356
## 4 0.010000 5 0.56250 0.984375 0.058463397
plot(tree2, uniform=TRUE,
main="Classification Tree for Categorical Target ", titlefont = 16)
text(tree2, use.n=TRUE, all=TRUE, cex=.65)
#evaluating the first tree model (continous) using the same training set
train_non_zero_scaled$pred_target_value_tree1<-exp(predict(tree1, train_non_zero_scaled))
(rmse2a<-RMSE(train_non_zero_scaled$target, train_non_zero_scaled$pred_target_value_tree1))
## [1] 4567201.665
(rmsle2a<-rmsle(preds = as.numeric(train_non_zero_scaled$pred_target_value_tree1), actuals = as.numeric(train_non_zero_scaled$target)))
## [1] 1.046575204
#evaluating the first tree model (continous) using the different training set
train_valid_scaled$pred_target_value_tree1<-exp(predict(tree1,train_valid_scaled))
(rmse2b<-RMSE(train_valid_scaled$target, train_valid_scaled$pred_target_value_tree1))
## [1] 6318737.663
(rmsle2b<-rmsle(preds = as.numeric(train_valid_scaled$pred_target_value_tree1), actuals = as.numeric(train_valid_scaled$target)))
## [1] 2.637274385
#evaluating the second tree model (categorical)
train_non_zero_scaled_sorted$pred_target_value_tree2<-predict(tree2, train_non_zero_scaled_sorted, type = "class")
confusionMatrix(train_non_zero_scaled_sorted$pred_target_value_tree2,as.factor(train_non_zero_scaled_sorted$bin_target), dnn = c("Prediction", "True Value"))
## Confusion Matrix and Statistics
##
## True Value
## Prediction High Low Medium VHigh VLow
## High 10 0 2 4 2
## Low 1 11 9 2 1
## Medium 0 2 3 1 1
## VHigh 0 2 1 9 0
## VLow 5 1 1 1 12
##
## Overall Statistics
##
## Accuracy : 0.5555556
## 95% CI : (0.4408856, 0.6660422)
## No Information Rate : 0.2098765
## P-Value [Acc > NIR] : 0.00000000001047319
##
## Kappa : 0.4448886
## Mcnemar's Test P-Value : 0.1851722
##
## Statistics by Class:
##
## Class: High Class: Low Class: Medium Class: VHigh
## Sensitivity 0.6250000 0.6875000 0.18750000 0.5294118
## Specificity 0.8769231 0.8000000 0.93846154 0.9531250
## Pos Pred Value 0.5555556 0.4583333 0.42857143 0.7500000
## Neg Pred Value 0.9047619 0.9122807 0.82432432 0.8840580
## Prevalence 0.1975309 0.1975309 0.19753086 0.2098765
## Detection Rate 0.1234568 0.1358025 0.03703704 0.1111111
## Detection Prevalence 0.2222222 0.2962963 0.08641975 0.1481481
## Balanced Accuracy 0.7509615 0.7437500 0.56298077 0.7412684
## Class: VLow
## Sensitivity 0.7500000
## Specificity 0.8769231
## Pos Pred Value 0.6000000
## Neg Pred Value 0.9344262
## Prevalence 0.1975309
## Detection Rate 0.1481481
## Detection Prevalence 0.2469136
## Balanced Accuracy 0.8134615
test_non_zero_scaled$pred_target_value_tree<-exp(predict(tree1,test_non_zero_scaled))
1d. Association Rules
To start off with this approach, we create a subset of the train_non_zero that includes only some of the variables marked as significant during regression analysis. After making every column in the subset categorical, we proceed with generating association rules for the variable “target” at 15% support and 60% confidence level.
#*use train_non_zero_scaled_sorted_ruled on target*
train_non_zero_scaled_sorted_ruled <- as(train_non_zero_scaled_sorted_ruled, "transactions")
target_rules <- apriori(data=train_non_zero_scaled_sorted_ruled, parameter=list (supp=0.035,conf = 0.75, minlen=3, maxlen=5), appearance = list (rhs=c("target=VLow", "target=Low", "target=Medium", "target=High", "target=VHigh")))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.75 0.1 1 none FALSE TRUE 5 0.035 3
## maxlen target ext
## 5 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 2
##
## set item appearances ...[5 item(s)] done [0.00s].
## set transactions ...[35 item(s), 81 transaction(s)] done [0.00s].
## sorting and recoding items ... [35 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [21 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(target_rules[1:10])
## lhs rhs support confidence lift count
## [1] {X20aa07010=VHigh,
## X024c577b9=VLow} => {target=Low} 0.03703703704 0.75 3.796875000 3
## [2] {X26fc93eb7=Low,
## X6619d81fc=Low} => {target=Medium} 0.03703703704 0.75 3.796875000 3
## [3] {X26fc93eb7=Low,
## X024c577b9=Medium} => {target=Medium} 0.03703703704 0.75 3.796875000 3
## [4] {X024c577b9=Medium,
## X1931ccfdd=High} => {target=Medium} 0.03703703704 1.00 5.062500000 3
## [5] {X20aa07010=High,
## X024c577b9=High} => {target=VHigh} 0.03703703704 0.75 3.573529412 3
## [6] {X26fc93eb7=High,
## X6619d81fc=VHigh} => {target=VHigh} 0.03703703704 1.00 4.764705882 3
## [7] {X26fc93eb7=High,
## X66ace2992=High} => {target=VHigh} 0.03703703704 1.00 4.764705882 3
## [8] {X26fc93eb7=Low,
## X66ace2992=Low,
## b43a7cfd5=Low} => {target=Low} 0.03703703704 0.75 3.796875000 3
## [9] {X26fc93eb7=Low,
## b43a7cfd5=Low,
## X1931ccfdd=Low} => {target=Low} 0.03703703704 0.75 3.796875000 3
## [10] {X20aa07010=VHigh,
## X6619d81fc=Medium,
## X024c577b9=VLow} => {target=Low} 0.03703703704 0.75 3.796875000 3
target_rules<-sort(target_rules, by="confidence", decreasing=TRUE)
#plot(target_rules[1:10], measure = "support", method="graph", engine='interactive', shading="confidence")
plot(target_rules[1:10], measure = "support", method="graph", shading="confidence")
1e. Neural Networks
Remember to recover dollar unit measures to compensate for data transformation Statistical summary of model performance and the choice of the best one for recommendation.
Steps going forward and additional research questions: * Highlight the need for the feedback on the actual test RMSE to support the iterative approach. * Clarification on the variables - demystify. *